contribution

All the group member participated in all the two assignments, and after discussion, formed this report.

Assignment 1

question1

#read the data
olive<-read.csv("olive.csv", row.names=1)
#draw the scatter plot,colored by original linoleic values
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot

#discretized linoleic into four classs and plot
linoleic_intervals<-cut_interval(x=olive$linoleic,n=4)
#draw the scatter plot,colored by discretized linoleic values
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot

In the first figure using the continuous linoleic data, it is difficult to distinguish the hue by human perception system. However,When using discrete variables. It is easier to recognize the data belonging to which class.

question2

#change the color based on the figure in question 1
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")+
  scale_color_manual(values = c("red", "blue", "green","orange"))
scatterplot

#change the size based on the figure in question 1
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,size=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot
## Warning: Using size for a discrete variable is not advised.

#change the size based on the figure in question 1
angel_olive<-olive%>%mutate(angle=runif(nrow(olive), 0, 2*pi))
scatterplot<-ggplot(data=angel_olive,aes(x=palmitic,y=oleic))+
  geom_point()+
  geom_spoke(aes(angle=angle),radius=45)+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot

The plot using different colors is the easiest to distinguish the categories, The next is the plot using orientation angle. The hardest one to differentiate between categories is the plot using different sizes due to many data points are overlapping. Connect to perception metrics: color(hue 10 levels, 3.1 bits), line orientation(3 bits),size(2.2 bits). It also shows color is the easiest one to perceive. The level of feature we can perceive is 8 levels which is equal to 3 bits.

question3

#draw the scatter plot,colored by numeric value of region
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,y=eicosenoic,color=Region))+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot

#draw the scatter plot,colored by categorical value of region
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,y=eicosenoic,color=as.factor(Region)))+
  ggtitle("dependence of oleic on eicosenoic")+
  labs(color="Region")
scatterplot

The scatter plot based on the numeric value of the region is a little bit difficult to identify the decision boundaries. As in this case, the region values are considered as continuous values. They have the same color but different brightness. So the Region value should be discretized. In the second plot, we can quickly identify the boundaries. Region is a categorical variable now. The plot is according to the three categories. In the latter case, preattentive mechanisms make it possible. The preattentive feature is hue.

question4

#draw the scatter plot
#colored by categorical value of linoleic
#shape is defined by a discretized Palmitic (3 classes)
#size is defined by a discretized Palmitoleic (3 classes)
linoleic_3intervals<-cut_interval(x=olive$linoleic,n=3)
palmitic_3intervals<-cut_interval(x=olive$palmitic,n=3)
palmitoleic_3intervals<-cut_interval(x=olive$palmitoleic,n=3)
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,
                 y=eicosenoic,
                 color=linoleic_3intervals,
                 shape=palmitic_3intervals,
                 size=palmitoleic_3intervals))+
  labs(color="linoleic",shape="palmitic",size="palmitoleic")+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot
## Warning: Using size for a discrete variable is not advised.

This figure contains too much information to display. It is hard to distinguish different points, especially the points with different size and shape. Many points are overlapping. This figure shows combining many metrics does not sum up the capacity.

question5

scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,
                 y=eicosenoic,
                 color=as.factor(Region),
                 shape=palmitic_3intervals,
                 size=palmitoleic_3intervals))+
  labs(color="Region",shape="palmitic",size="palmitoleic")+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot
## Warning: Using size for a discrete variable is not advised.

Treisman’s theory shows the figure is processed in parallel by checking the individual feature maps and combining the features takes some time. In this case, A specific preattentive feature hue is processed, and it is easy to see the boundaries based on the colour. However, in conjunction with the shape and size, it requires a lot of effort.

question6

#create a pie chart using plotly
count_area<-olive%>%group_by(Area)%>% summarize( count=n())
proportion<-count_area$count/sum(count_area$count)*100
pie_chart<-plot_ly(data=count_area,labels=~Area,values=~proportion,textinfo = "none")%>%
  add_pie()%>% layout(title = "proportions of oil pie chart",showlegend=FALSE)
pie_chart

As the labels are hidden and only hover_on values are kept. It is a little difficult to distinguish the proportions especially for those whose proportions are similar. And it is also not so convenient if we want to know some values. We have to hover on the cursor again and again to get the proportions.

question7

# contour plot
contour_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
  geom_density_2d()
contour_plot

#scatter plot
scatter_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
  geom_point()
scatter_plot

The scatter plot based on the two variables shows that the observations can be divided into two groups. However, it is difficult to see the groups in the contour plot. The boundary is not clear.

Assignment 2

question1

# Read data
q2data <- read_xlsx("baseball-2016.xlsx")
# Check the range of the below variables
print(c(max(q2data$HR),min(q2data$HR)))
## [1] 253 122
print(c(max(q2data$RBI),min(q2data$RBI)))
## [1] 836 575
print(c(max(q2data$OBP),min(q2data$OBP)))
## [1] 0.348 0.299

Here, three variables, HR (Home Runs), RBI (Runs batted in), and OBP (On Base Percentage), are shown to check the range of different variables. It is observed that there is significant variation in the range of these different variables. Therefore, it is reasonable to scale these data before performing multidimensional scaling (MDS).

question2

#Using code template from course website
q2data.numeric= scale(q2data[,3:28])
d = dist(q2data.numeric, method = "minkowski")
res=isoMDS(d,k=2)
## initial  value 19.856833 
## iter   5 value 16.319153
## iter  10 value 16.046215
## final  value 15.935476 
## converged
coords=res$points

q2MDS=as.data.frame(coords)
q2MDS$League=q2data$League
q2MDS$Team=q2data$Team

plot_22 <- plot_ly( data = q2MDS, x = q2MDS[,1], y = q2MDS[,2],color =q2MDS$League, 
         colors = c("red","black") ,type = "scatter", mode = "markers", text = q2MDS$Team)
plot_22

By observing the scatter plot, it can seen that the majority teams form AL league are withinthe range of -2.6 to 3.6 on the x-axis and above -1.26 on the y-axis, with only 3 teams form NL league are in this region. The y-axis(2nd MDS component) is the best differentiate two leagues, since both league span similarly on x-axis. By using syntax text = q2MDS$Team, one can check the team name on different data point. The Boston Red Sox seems to be the outlier in this context as it is the only AL league team that is outside the aforementioned range.

question3

#Using code template from course website
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(q2data)[index1],
                            '<br> Obj 2: ', rownames(q2data)[index2]))%>%
  #if nonmetric MDS inolved
  add_lines(x=~sh$x, y=~sh$yf)
#Check stress of MDS
print(res$stress)
## [1] 15.93548

There are two ways to check how successful the MDS performed. The first one is directly observed in the Shepard plot, if the line is diagonal and dots are aligned with it, the MDS can be considered successful. In the above case, as the line is not a diagonal straight line and the dots are scattered, it can be seen that the MDS performance is not ideal. The second method is to check the “stress”(the goodness of fit, which is calculated from residuals around the line), the lower the better, and with the stress = 15.9354757. This supports the conclusion that MDS does not perform ideally.
By hovering on the dots that has the longer distance between the line. Two pairs of data point are observed, <Obj1: 20 Obj2: 16> and <Obj1: 17 Obj2: 1>, which represent <Oakland Athletics,Milwaukee Brewers> and <Minnesota Twins, Aizona Diamondbacks> respectively. These two pairs are hard for the MDS to map successfully.

question4

Q24df <- cbind(q2data,res$points[,2])
Q24_plot_function <- function(i){
  ggplot(data = Q24df)+geom_point(aes(x=Q24df[, 29],y=Q24df[,i]))+  
    xlab("MDS variable 2") +  
    ylab("") +  
    ggtitle(colnames(Q24df[i]))
}

plot_list <- lapply(seq(3,28,1),Q24_plot_function)
grid_plot <- grid.arrange(grobs=plot_list,
                          top=("MDS variable 2 against all other numerical variables"),ncol=4)

By observing the scatter plots of the second MDS variable and numerical variables. The two variable that have the strongest connection are HR.per.game(Home Runs per Game) and HR(Home Runs), both are positive.
These two variables are related to each other, and both are important statistics in scoring the baseball teams. The term, Home Run, means a hit that allows the batter to make a complete circuit of the bases and score a run, which means a home run can guarantee at least one point and sometimes more for the team. Hence, a team with higher HR.per.game and HR can be consider as a team have higher scoring potential and better team.

Appendix

knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)
library(MASS)
library(gridExtra)
#read the data
olive<-read.csv("olive.csv", row.names=1)
#draw the scatter plot,colored by original linoleic values
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot

#discretized linoleic into four classs and plot
linoleic_intervals<-cut_interval(x=olive$linoleic,n=4)
#draw the scatter plot,colored by discretized linoleic values
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot
#change the color based on the figure in question 1
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")+
  scale_color_manual(values = c("red", "blue", "green","orange"))
scatterplot

#change the size based on the figure in question 1
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=palmitic,y=oleic,size=linoleic_intervals))+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot

#change the size based on the figure in question 1
angel_olive<-olive%>%mutate(angle=runif(nrow(olive), 0, 2*pi))
scatterplot<-ggplot(data=angel_olive,aes(x=palmitic,y=oleic))+
  geom_point()+
  geom_spoke(aes(angle=angle),radius=45)+
  ggtitle("dependence of Palmitic on Oleic")
scatterplot
#draw the scatter plot,colored by numeric value of region
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,y=eicosenoic,color=Region))+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot

#draw the scatter plot,colored by categorical value of region
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,y=eicosenoic,color=as.factor(Region)))+
  ggtitle("dependence of oleic on eicosenoic")+
  labs(color="Region")
scatterplot
#draw the scatter plot
#colored by categorical value of linoleic
#shape is defined by a discretized Palmitic (3 classes)
#size is defined by a discretized Palmitoleic (3 classes)
linoleic_3intervals<-cut_interval(x=olive$linoleic,n=3)
palmitic_3intervals<-cut_interval(x=olive$palmitic,n=3)
palmitoleic_3intervals<-cut_interval(x=olive$palmitoleic,n=3)
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,
                 y=eicosenoic,
                 color=linoleic_3intervals,
                 shape=palmitic_3intervals,
                 size=palmitoleic_3intervals))+
  labs(color="linoleic",shape="palmitic",size="palmitoleic")+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot
scatterplot<-ggplot(data=olive)+
  geom_point(aes(x=oleic,
                 y=eicosenoic,
                 color=as.factor(Region),
                 shape=palmitic_3intervals,
                 size=palmitoleic_3intervals))+
  labs(color="Region",shape="palmitic",size="palmitoleic")+
  ggtitle("dependence of oleic on eicosenoic")
scatterplot
#create a pie chart using plotly
count_area<-olive%>%group_by(Area)%>% summarize( count=n())
proportion<-count_area$count/sum(count_area$count)*100
pie_chart<-plot_ly(data=count_area,labels=~Area,values=~proportion,textinfo = "none")%>%
  add_pie()%>% layout(title = "proportions of oil pie chart",showlegend=FALSE)
pie_chart

# contour plot
contour_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
  geom_density_2d()
contour_plot

#scatter plot
scatter_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
  geom_point()
scatter_plot
# Read data
q2data <- read_xlsx("baseball-2016.xlsx")
# Check the range of the below variables
print(c(max(q2data$HR),min(q2data$HR)))
print(c(max(q2data$RBI),min(q2data$RBI)))
print(c(max(q2data$OBP),min(q2data$OBP)))
#Using code template from course website
q2data.numeric= scale(q2data[,3:28])
d = dist(q2data.numeric, method = "minkowski")
res=isoMDS(d,k=2)
coords=res$points

q2MDS=as.data.frame(coords)
q2MDS$League=q2data$League
q2MDS$Team=q2data$Team

plot_22 <- plot_ly( data = q2MDS, x = q2MDS[,1], y = q2MDS[,2],color =q2MDS$League, 
         colors = c("red","black") ,type = "scatter", mode = "markers", text = q2MDS$Team)
plot_22
#Using code template from course website
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(q2data)[index1],
                            '<br> Obj 2: ', rownames(q2data)[index2]))%>%
  #if nonmetric MDS inolved
  add_lines(x=~sh$x, y=~sh$yf)
#Check stress of MDS
print(res$stress)
Q24df <- cbind(q2data,res$points[,2])
Q24_plot_function <- function(i){
  ggplot(data = Q24df)+geom_point(aes(x=Q24df[, 29],y=Q24df[,i]))+  
    xlab("MDS variable 2") +  
    ylab("") +  
    ggtitle(colnames(Q24df[i]))
}

plot_list <- lapply(seq(3,28,1),Q24_plot_function)
grid_plot <- grid.arrange(grobs=plot_list,
                          top=("MDS variable 2 against all other numerical variables"),ncol=4)